-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Guarantee missing stream promise delivery #12207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the PR. Can you fix the failing unit tests? |
470738a
to
be78878
Compare
@kannanjgithub tests pass locally with modifications (not sure if they're suitable given the logic switch), but the checks don't seem to be rerunning. |
@kannanjgithub the only failure in Linux artifacts for Kokoro seems to be a content issue with apache's hosting from curl (retried the command locally to have it pass). |
It seems clear we need a unit test that triggers this, because we'd be very likely to break it again. I don't actually see what case is being missing in the current code. The comment seems to be talking about a case that already has a test covering it. So do things just need a slight additional tweak to trigger the breakage? grpc-java/netty/src/test/java/io/grpc/netty/NettyClientHandlerTest.java Lines 261 to 273 in a8de9f0
This does seem like a good line of investigation. Note that I've only skimmed this as of yet, so I could be misunderstanding. |
@ejona86 I'm out of my depth here, so I apologize if this doesn't make any sense.
Based on your ask, and the observation of this situation, you're looking for a test which exhibits this lack of notification on the listener (where netty is missing this delivery)? If so, doesn't there need to be a higher level listener registered than just at the handler level (that we assert was not called before delivering the promise failure)? |
The state of the connection looks suspect. This is debugging output for each of my outstanding calls that indicates we're stuck:
For all of the stuck streams, the id is greater than the highest active stream. These active streams don't seem to be doing anything, but exist in the plurality that would imply how many stuck calls I have. All of them 'may have existed', for whatever that means. Looking at DefaultHttp2Connection.java in netty, I don't see any concern for concurrent modifications of these sequence ids, streamMap, or activeStreams. Maybe I'm missing some serialization that is preventing any possible threaded interaction with connection().local(), but I don't understand otherwise how this is not far more prevalent. |
I think the only case where
That guarantees that in the case of a bug the RPC will still become closed, and it means we'll likely see a bug report to investigate such a case if it happens, to figure out what went wrong. And it won't be a hang; hangs are horrible to debug. But that also means I don't think this fixes the issue you are hitting. In the comment, "a stream buffered in the encoder" should be referring to Netty's I'll note that The RST_STREAM I think is referring to (note that it calls grpc-java/netty/src/main/java/io/grpc/netty/NettyClientHandler.java Lines 788 to 795 in be78878
The
So no bug there. Looking more at StreamBufferingEncoder, and the places it removes streams from
So no bug there. |
@ejona86 |
I still need to stare at the connection state more, but the model for Netty is all the state changes happen on a single thread: the event loop. There are multiple threads ("event loop group"), but when a connection is created it chooses one and uses it for its lifetime. A single event loop can handle multiple connections. So there's no need for synchronization. However callback ordering can get pretty nasty; we have definitely seen problems in the past with callbacks being executed in a bad order and some call is made directly without popping up the stack first (e.g., reentrancy). |
at will. Nice. Does this PR fix the hang? |
Yes, it does in all cases. I haven't proven that the expected null stream case ever occurs where we DO receive the CANCELLED (as implied/intended by the unchanged implementation), but this overreach in terms of exception delivery guarantees that we fall back to what becomes a retry (and eventually resolves). |
I think I found it. I'll have to think a bit for how to fix it. This PR can definitely go in because it would prevent the RPCs from hanging forever, but I do think we should be using INTERNAL, because this was indeed a bug. I realized I missed mentioning the case that the stream was created but then was killed before the callback was called. I considered it, but any further writes would have their callbacks called in appropriate order. And the RPC isn't known to the remote yet, so there shouldn't be any RST_STREAM. But I missed GOAWAY and purposefully closing the stream, coupled with the write being buffered by Netty core I/O, not by StreamBufferingEncoder. If the stream is created but the receiving peer (the server) is a bit slower, then the HEADERS will be enqueued waiting to be sent. There's two cases I see:
Both of those "races" can only happen to the last few RPCs on a connection, as the HEADERS have to be buffered locally still. In both of these cases, we have a proper error, but just aren't communicating it to the stream. We want to avoid making a status in |
Sounds good, offer still stands to add any debug logging/trace to my at will reproducer to solve this the right way. Just to confirm: I change CANCELLED to INTERNAL and you're a green stamp? Do you want the exception creation outside of |
@werkt, this should fix what you are seeing. It isn't a full fix, because it doesn't work if the stream was buffered initially by StreamBufferingEncoder and then later by the I/O subsystem when the GOAWAY was received. But I doubt you are seeing that case. diff --git a/netty/src/main/java/io/grpc/netty/NettyClientHandler.java b/netty/src/main/java/io/grpc/netty/NettyClientHandler.java
index a5fa0f800..b455180bb 100644
--- a/netty/src/main/java/io/grpc/netty/NettyClientHandler.java
+++ b/netty/src/main/java/io/grpc/netty/NettyClientHandler.java
@@ -768,6 +768,10 @@ class NettyClientHandler extends AbstractNettyHandler {
}
}
});
+ Http2Stream http2Stream = connection().stream(streamId);
+ if (http2Stream != null) {
+ http2Stream.setProperty(streamKey, stream);
+ }
}
/** That is being run immediately after the |
Yes. That is strictly better while also not ignoring the fact that there is a bug in the code. But you'll probably start noticing the failure. My small patch should fix that for you (but not in all cases). We can get both of those fixes in our next release (originally scheduled for last week, but now looking like this week).
Really, the normal "proper" thing is to call But I'm quite willing to pay the exception creation, if it avoids an RPC hang. This is apparently pretty rare for most people, too, because I think this bug has always existed in grpc-java. (Aren't you lucky, to find such a bug!) I expect your client and server aren't in the same datacenter; you're going over a slower link of some sort, or just pushing a lot of bytes, or the server is under some CPU pressure. With that in mind, I can understand how many people wouldn't see this, because it requires a race of sending an RPC when receiving a GOAWAY while the TCP connection is fully buffered. |
@ejona86 Yahtzee. Your change alone also fixes the hang preliminarily. I'm going to run my overnight test to confirm for sure (set one up for the previous fix). Let me know what you want to do here, happy to close this assuming it turns out the same. Yes, this is definitely against a server with high latency (22ms) and limited connectivity (1Gbps from client), in extreme concurrency (~100 streams/channel) with nearly full link saturation and cpu utilization. Server is well equipped for this, and is actually going through nginx to reach java again on the receive side. |
In grpc#12185, RPCs were randomly hanging. In grpc#12207 this was tracked down to the headers promise completing successfully, but the netty stream was null. This was because the headers write hadn't completed but stream.close() had been called by goingAway().
I'd hope to still get this PR in. I'm doing my other half in #12222; this will be the back-up for when things still go wrong (as we know they still can). |
In observed cases, whether RST_STREAM or another failure from netty or the server, listeners can fail to be notified when a connection yields a null stream for the selected streamId. This causes hangs in clients, despite deadlines, with no obvious resolution. Tests which relied upon this promise succeeding must now change.
be78878
to
2a766c7
Compare
So be it. I changed CANCELLED to INTERNAL. Tests will still fail in this condition, so leaving them alone. |
@ejona86 I'm unable to get the kokoro tests to run locally, does the test failure make sense to you? |
They were just flakes. I restarted it earlier and it seems to be passing now. The Java 17 is also a flake. |
In grpc#12185, RPCs were randomly hanging. In grpc#12207 this was tracked down to the headers promise completing successfully, but the netty stream was null. This was because the headers write hadn't completed but stream.close() had been called by goingAway().
In observed cases, whether RST_STREAM or another failure from netty or the server, listeners can fail to be notified when a connection yields a null stream for the selected streamId. This causes hangs in clients, despite deadlines, with no obvious resolution.
This is not simply a race between netty to deliver a result interpreted as a failure and the setSuccess previously implemented, the netty layer does not report the stream as failed.
Fixes #12185